Univariate exploratory data analysis

GEOG 30323

September 22, 2015

Time for data!

Source: bigdatapix.tumblr.com

The data analysis process

Adapted from H. Wickham

Exploratory data analysis

  • “Detective work” to summarize and explore datasets

Includes:

  • Data acquisition and input
  • Data cleaning and wrangling (“tidying”)
  • Data transformation and summarization
  • Data visualization

Your core Python tools for EDA: NumPy, pandas, and seaborn/matplotlib

NumPy

  • Extension to Python; the core Python package for numerical computing
  • Standard import: import numpy as np
  • Data structure: the NumPy array. Sort of like a list - but with more methods, and can be multidimensional
import numpy as np

y = np.array([[2, 4, 6, 8, 10, 12], 
             [1, 3, 5, 7, 9, 11], 
             [10, 12, 14, 18, 22, 14], 
             [9, 3, 3, 3, 3, 1]])
            

Pandas

  • Built on top of NumPy; adds support for table-like data structures in Python
  • Standard import: import pandas as pd
  • Sequences of data are stored as Series objects, which collectively form DataFrames
import pandas as pd

df = pd.DataFrame(y, columns = list('x' + str(num) for num in range(1, 7)))

The pandas DataFrame

  • Commonly, DataFrames are created by reading in external data, like CSV files
# To read in CSV files, we use the pd.read_csv function 

grad = pd.read_csv('grad_rates.csv')

The pandas DataFrame

  • Each observation forms a row, defined by an index; attributes of those observations are found in the columns of the DataFrame

  • Columns are accessible as indices, e.g. grad['rate'], or as attributes of the data frame, e.g. grad.rate

Levels of measurement

  • Nominal: qualitative, descriptive, categories
  • Ordinal: ordering or ranking; however, no information about distance between ranks
  • Interval: additive; no natural zero (zero is a meaningful value)
  • Ratio: multiplicative; natural zero (zero means an absence of a value)

Make sure you know your column types (dtypes) and levels of measurement before doing analysis!

Measures of central tendency

  • Mode: the most typical value in a distribution
  • Median: the “balancing point” in a distribution (50 percent of observations above and below)
  • Mean: the arithmetic average of a distribution

The mean of a sample (\(\overline{x}\)) is calculated as follows:

\[\overline{x} = \dfrac{x_1 + x_2 + ... + x_n}{n}\]

where \(n\) is the number of elements in the sample.

Measures of dispersion

  • Range: difference between maximum and minimum values in a distribution
  • Interquartile range: difference between the values at the 25 percent and 75 percent points in a distribution
  • Variance and standard deviation

Variance

  • A measure of the spread of a sample. The variance is computed as:

\[{\sigma}^2 = \dfrac{\sum\limits_{i=1}^{n}(x_i - \overline{x})^2}{n}\]

or, in simpler terms, the average of the squared deviations of the values of a sample from its mean.

Standard deviation

  • Computed as the square root of the variance; denoted by \(\sigma\).
  • Offers a standardized way to discuss the spread of a distribution. For example, in a normally distributed sample:
    • About 67 percent of the values will be within one standard deviation of the mean
    • About 95 percent of the values will be within two standard deviations of the mean
    • About 99 percent of the values will be within three standard deviations of the mean

Descriptive statistics in pandas

  • Descriptive stats are available in pandas as data frame methods, e.g. grad.mean(), grad.std()
  • Calling .describe() will give you back a number of important descriptive stats at once
grad.describe()

Exploratory visualization

  • Often, when exploring a dataset, you’ll want to use graphical representations of your data to help reveal insights/trends
  • Visualization: Graphical representation of data

Visualization in Python

  • Core visualization package in Python: matplotlib - which comes pre-installed with Anaconda
  • To show matplotlib graphics in your Jupyter Notebook, type %matplotlib inline

  • seaborn: extension to matplotlib to make your graphics look nicer! Seaborn is available from Anaconda but not pre-installed. To install from the command line, type conda install seaborn
  • Standard import: I use import seaborn as sb, the creator uses import seaborn as sns.

Histograms

  • Histogram: graphical representation of a frequency distribution
  • Observations are organized into bins, and plotted with values along the x-axis and the number of observations in each bin along the y-axis
  • Normal distribution: histogram is approximately symmetrical (a “bell curve”)
  • Histograms are built into pandas

Example histogram

%matplotlib inline

import seaborn as sb

grad.rate.hist()

Density plots

  • Smooth representations of your data can be produced with kernel density plots
  • Accessible from both pandas and seaborn
sb.kdeplot(grad.rate, shade = True)

Box plots

  • Also termed “box and whisker plots” - alternative way to show distribution of values graphically
sb.boxplot(grad.rate, color = "green")

Anatomy of a box plot

  • Dots beyond the whiskers: outliers

Violin plots

  • Combinations of box plots and kernel density plots
sb.violinplot(grad['rate'], color = 'cyan')